import urllib
class Alphabet():
""" A minimal class for alphabets
Alphabets include DNA, RNA and Protein """
def __init__(self, symbolString):
self.symbols = symbolString
def __len__(self): # implements the "len" operator, e.g. "len(Alphabet('XYZ'))" results in 3
return len(self.symbols) # will tell you the length of the symbols in an Alphabet instance
def __contains__(self, sym): # implements the "in" operator, e.g. "'A' in Alphabet('ACGT')" results in True
return sym in self.symbols # will tell you if 'A' is in the symbols in an Alphabet instance
def __iter__(self): # method that allows us to iterate over all symbols, e.g. "for sym in Alphabet('ACGT'): print sym" prints A, C, G and T on separate lines
= tuple(self.symbols)
tsyms return tsyms.__iter__()
def __getitem__(self, ndx):
""" Retrieve the symbol(s) at the specified index (or slice of indices) """
return self.symbols[ndx]
def index(self, sym):
""" Retrieve the index of the given symbol in the alphabet. """
return self.symbols.index(sym)
def __str__(self):
return self.symbols
""" Below we declare alphabet variables that are going to be available when
this module (this .py file) is imported """
= Alphabet('ACGT')
DNA_Alphabet = Alphabet('ACGU')
RNA_Alphabet = Alphabet('ACDEFGHIKLMNPQRSTVWY')
Protein_Alphabet = Alphabet('ACDEFGHIKLMNPQRSTVWYX')
Protein_wX = Alphabet('ACDEFGHIKLMNPQRSTVWY-')
Protein_wGAP
def getSequence(entryId, dbName = 'uniprotkb', alphabet = Protein_Alphabet, format = 'fasta', debug: bool = True):
""" Retrieve a single entry from a database
entryId: ID for entry e.g. 'P63166' or 'SUMO1_MOUSE'
dbName: name of database e.g. 'uniprotkb' or 'pdb' or 'refseqn'; see http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases for available databases
format: file format specific to database e.g. 'fasta' or 'uniprot' for uniprotkb (see http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases)
See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp for more info re URL syntax
"""
if not isinstance(entryId, str):
= entryId.decode("utf-8")
entryId ='http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?style=raw&db=' + dbName + '&format=' + format + '&id=' + entryId
url try:
if debug:
print('DEBUG: Querying URL: {0}'.format(url))
= urllib.request.urlopen(url).read()
data if format == 'fasta':
return readFastaString(data.decode("utf-8"), alphabet)[0]
else:
return data.decode("utf-8")
except urllib.error.HTTPError as ex:
raise RuntimeError(ex.read())
# get the covid 19 genome (29kB)
= getSequence('MN908947', 'genbank', DNA_Alphabet)
covid_seq # print(seq_no10)
# get all the bases from covid_seq
# translate all of those base into amino acid seq
# in all reading frames (6)
= [0, 1, 2]
tr_f # translate protein
= []
covid_AAseq
for i in tr_f:
#print("all ORF in fwd direction", covid_seq.translateDNA(i, True))
=covid_seq.translateDNA(i, True)
seq10b=seq10b.split("*")
protseq= covid_seq.translateDNA(i, False)
seq10br "*"))
protseq.extend(seq10br.split(print(str(protseq))
# print(protseq)
# for element in protseq:
# print("Individual value is",element)
# #for i in tr_f:
# print("all ORF in reverse direction", covid_seq.translateDNA(i, False))
#NEXT STEP (for mapping the ORF that begins with M & calculate the len > 100)
= []
ORF for i in protseq:
if i.startswith('M') == True:
ORF.append(i)print('all of potential ORF', ORF)
print('length of all potential ORF')
for i in ORF:
print(i, ":", len(i))
= []
true_ORF for i in ORF:
if len(i) > 100:
true_ORF.append(i)print(true_ORF)
# for i in protseq:
# if i == 'M': # and len(i) > 100:
# ORF.append(i)
# print(ORF)
2023 Sep 7th – UQ PUG 1
Welcome to UQ Python User Group! Check out our general information for details about who we are and what we do.
Structure
- We will start today by having everyone add their names to this page.
- Add your questions to this page.
- This Month’s Presentation.
- Finally we will spend the rest of the session answering the questions you have brought!
This month’s presentation
Welcome to our first Python User Group gathering! This month, Luke and Cameron give an overview of the group, our vision and Noteable, the interactive collaborative notebook platform which can run markdown, Python, R and SQL.
Introduce yourself
What’s your name? | Where are you from? | Why are you here? |
---|---|---|
Luke | Library | Learn |
Nick | Library | Learn |
Valentina Urrutia Guada | Library | Learn |
Nida | School of Molecular Biology | (1) Understanding the logic (in some codes) for bioinformatics and metagenomics application, (2) Learn how to visualize data in python that is hard to do in excel |
Cameron | Library | Learn |
Sam Hames | School of Languages and Cultures | Community |
Nikhil | School of EECS | Learn |
Paul Vrbik | EECS | Support |
Jason Dail | SENV | Learn |
Annie Nguyen | SENV | Learn :) |
Research tools
Here are a few links we shared around, mostly from Jason.
https://researchrabbitapp.com/home helpful when making research collections, mapping concepts, and looking at the linkages between references
https://consensus.app/
https://chat.openai.com/
https://article-summarizer.scholarcy.com/summarizer
https://typeset.io/ AI assistant for reading and understanding papers
https://www.listendata.com/2023/03/how-to-run-chatgpt-inside-excel.html (excel extension for chat GPT how-to)
EECS tutor help: https://eecs.uq.edu.au/current-students/eecs-learning-centre-tutors
Questions
If you have any Python questions you’d like to explore with the group, please put them in a markdown cell, with any code you’d like us to run in a Python cell.
Question 1 - Finding substrings for COVID sequencing - Note that the formatting has not transferred correctly here Nida
Nida has a problem where she needs to identify specific substrings from a large sequence of characters (DNA sequence). Her code is below
I just found out that the code:
covid_seq = getSequence('MN908947', 'genbank', DNA_Alphabet)
is actually trying to get a DNA sequence of covid from genbank & will give output of a string containing 29,900 characters (that we can actually see by clicking on the EBI website below)
it is use a function called "getSequence"
with code: `def getSequence(entryId, dbName = ‘uniprotkb’, alphabet = Protein_Alphabet, format = ‘fasta’, debug: bool = True):
if not isinstance(entryId, str): entryId = entryId.decode(“utf-8”)
url =‘http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?style=raw&db=’ + dbName + ‘&format=’ + format + ‘&id=’ + entryId
try: if debug: print(‘DEBUG: Querying URL: {0}’.format(url))
data = urllib.request.urlopen(url).read() if format == ‘fasta’: return readFastaString(data.decode(“utf-8”), alphabet)[0]
else: return data.decode(“utf-8”)
except urllib.error.HTTPError as ex: raise RuntimeError(ex.read())
`#This function retrieves a single entry from a database (entryId: ID for entry e.g.’MN908947’, dbName: name of database e.g. ‘genbank’)
once we got that DNA string (consisting of only 4 types of characters-A/C/T/G), a biologist will translate them (into AminoAcid or Protein string, consisting 20 types of characters + asterisk/* , please see https://www.hgvs.org/mutnomen/codon.html). We’ll translate them using a dictionary (code not shown) and then we’ll split them based on * so that we can generate an output of a list of strings that is stored in a variable called “protseq” (see https://docs.google.com/document/d/1R22IGMfe9i1tYAlPK5ZSOikV-ON6xZVUq-C87xDMiLg/edit?usp=sharing )
finally, from that I need to find how many strings inside of that “protseq” that meet these criteria: 1) start with M 2) end with * 3) has length of >=100 characters
so, my question is actually: how to understand the logic behind the below code that is said to be able to do that job:
#check first occurence of M in each string of that ‘protseq’ list >>where_M_in_protseq_1st-string = protseq[0].find(‘M’) print(where_M_in_protseq_1st-string) print(len(protseq[0]))
#code to check whether each string in that ‘protseq’ start with M and >= 100 >>cnt = 0
for i in protseq: >if len(i) - i.find(“M”) >= 100: >>cnt +=1
print(cnt)
for seq in protseq: >m_pos = seq.find(“M”)
>m_end_seq = seq[m_pos:]
if len(m_end_seq) > 100: >print(m_end_seq) >print(len(m_end_seq))
The answer for this problem should be 8 strings inside 18 members of that protseq list will meet that criteria (so if we can end up getting 8 from that code we are correct)–> (but I just hope that I can get it correct and understand how the code works)
Importance: from that 8 strings, we can try 1 of them in a real protein database (called Uniprot-KB) and know what part of covid body that is likely to interact with human and causing disease
Thank you very much
= "asjidowgeriogpjicnjjlaksdjalksdj*alskjdjjjjeosj hjjjjl"
characters
# We want to pick out "jid" and "jic"
= []
ans for k, c in enumerate(characters):
if c == "j": # add abtrary number of constraints
+3])
ans.append(characters[k:kif c == "*":
break
if len(ans[-1]) < 3:
ans.pop()
print(ans)
= "asjidowgeriogpjicneosjl" characters